Finding nuggets in documents: A machine learning approach

نویسندگان

Yi-fang Brook Wu

Quanzhi Li

Razvan Stefan Bot

Xin Chen

چکیده

However, many text mining applications do not have adequate natural language processing ability beyond simple keyword indexing, and as a result, there are too many textual elements (words) included in the analysis. We argue that noun phrases as textual elements are better suited for text mining and could provide more discriminating power, than single words. Discourse representation theory (Kamp, 1981) and language learning of children (Snow & Ferguson, 1997) show that a document’s primary concepts are carried by noun phrases. Because noun phrase in a document are not equally important, we propose using them as candidates and identifying keyphrases from them. Document keyphrases are the most important topical phrases for a given document, and they address the main topics of that document. Our study proposes a Keyphrase Identification Program (KIP) to approach this problem by analyzing the composition of noun phrases. Keyphrases provide semantic metadata that can characterize documents and produce an overview of the content of a document. Keyphrases can be used in many text-mining related applications. If keyphrases are used in automatic text summarization, applications can extract sentences with more keyphrases or higher keyphrase scores. If keyphrases are used as document metadata, applications can use them to efficiently classify or cluster documents into different categories. They may be utilized to enrich the metadata of the results returned from a search engine. Another use is that some search engines implement interactive query refinement using keyphrases, and also use them as a way of browsing a collection. Last, but not least, keyphrases may be extracted from documents to construct a domain glossary or thesaurus. The previous studies of the various applications of keyphrases will be presented in the next section. Some documents, mostly scholarly papers, have a list of keyphrases provided by authors, but unfortunately, most documents do not have author-assigned keyphrases. Keyphrases can also be assigned manually by professional indexers. The indexers may choose phrases from the document text as keyphrases, or, more commonly, choose phrases from a predefined controlled vocabulary. However, manually assigning keyphrases to documents is costly and tedious, and the results Document keyphrases provide a concise summary of a document’s content, offering semantic metadata summarizing a document. They can be used in many applications related to knowledge management and text mining, such as automatic text summarization, development of search engines, document clustering, document classification, thesaurus construction, and browsing interfaces. Because only a small portion of documents have keyphrases assigned by authors, and it is timeconsuming and costly to manually assign keyphrases to documents, it is necessary to develop an algorithm to automatically generate keyphrases for documents. This paper describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified phrases to assign weights to the candidate keyphrases. The logic of our algorithm is: The more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database. KIP’s personalization feature will let the user build a glossary database specifically suitable for the area of his/her interest. The evaluation results show that KIP’s performance is better than the systems we compared to and that the learning function is effective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Answer Extraction for Definition Questions using Information Gain and Machine Learning

Extracting nuggets (pieces of an answer) is a very important process in question answering systems, especially in the case of definition questions. Although there are advances in nugget extraction, the problem is finding some general and flexible patterns that allow producing as many useful definition nuggets as possible. Nowadays, patterns are obtained in manual or automatic way and then these...

متن کامل

A Hybrid Method for Opinion finding Task (KUNLP at TREC 2008 Blog Track)

This paper presents an approach for the Opinion Finding task at TREC 2008 Blog Track. For the Ad-hoc Retrieval subtask, we adopt language model to retrieve relevant documents. For the Opinion Retrieval subtask, we propose a hybrid model of lexicon-based approach and machine learning approach for estimating and ranking the opinionated documents. For the Polarized Opinion Retrieval subtask, we em...

متن کامل

Extracting information nuggets from disaster- Related messages in social media

Microblogging sites such as Twitter can play a vital role in spreading information during “natural” or man-made disasters. But the volume and velocity of tweets posted during crises today tend to be extremely high, making it hard for disaster-affected communities and professional emergency responders to process the information in a timely manner. Furthermore, posts tend to vary highly in terms ...

متن کامل

Reachability checking in complex and concurrent software systems using intelligent search methods

Software system verification is an efficient technique for ensuring the correctness of a software product, especially in safety-critical systems in which a small bug may have disastrous consequences. The goal of software verification is to ensure that the product fulfills the requirements. Studies show that the cost of finding and fixing errors in design time is less than finding and fixing the...

متن کامل

A Hybrid Machine Learning Method for Intrusion Detection

Data security is an important area of concern for every computer system owner. An intrusion detection system is a device or software application that monitors a network or systems for malicious activity or policy violations. Already various techniques of artificial intelligence have been used for intrusion detection. The main challenge in this area is the running speed of the available implemen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

JASIST

دوره 57 شماره

صفحات -

تاریخ انتشار 2006

Finding nuggets in documents: A machine learning approach

نویسندگان

چکیده

منابع مشابه

Answer Extraction for Definition Questions using Information Gain and Machine Learning

A Hybrid Method for Opinion finding Task (KUNLP at TREC 2008 Blog Track)

Extracting information nuggets from disaster- Related messages in social media

Reachability checking in complex and concurrent software systems using intelligent search methods

A Hybrid Machine Learning Method for Intrusion Detection

عنوان ژورنال:

اشتراک گذاری